This document walks you through the steps of calculating design weights, running analyses, and understanding collapsing confidence intervals with increased datasets, despite temporal variability for the purpose of writing the Integrated Report chapter for the Probabilistic Monitoring Program due to EPA every two years. This document overviews the process taken each year up until the 2018 IR, illustrating the confidence in estimates with increasing n. The document uses the James River Basin and Big Sandy to juxstapose two disparate areas on the Virginia landscape to explain: why it took so long to report on smaller basins and what happens to confidence in estimates in big and small watersheds over time.
Starting at the same point of ‘HowToWrite2018IRProbMonChapter.Rmd,’ we are going to begin with the raw data and subset data from program inception (2001) to different IR cycle cut offs before running estimates on VSCI/VCPMI for the James and Big Sandy Basins. All trend weight adjustments will be run on these data subsets to maintain comparability of estimates.
Bring in appropriate datasets. Always start with VSCI (VSCI/VCPMI) first because this parameter will have the best information on whether or not to include the station for design weight calculations.
#### Read in master field data sampled at all sites
# Select the parameters we want to run
# rename a few parameters to work better in R
surveyData <- readxl::read_excel('originalData/Wadeable_ProbMon_2001-2016_JRH.xlsx',
sheet='ProbMonData2001-2016_EVJ') %>%
dplyr::select(DataSource,StationID,Year,StationID_Trend,LongitudeDD,LatitudeDD,stratum,designweight,
weightcategory,station,state,status,comment,set,Basin,SubBasin,BayShed,BayPanel,
EcoRegion,BioRegion,Panel,BioPanel,Order,BasinSize,StreamSizeCat,StreamSizeCatPhase,
AREA_SQ_MILES,IR2008,IR2010,IR2012,IR2014,IR2016,IR2018,designweighttrend,
designweightoriginal,filwgt_trend,filwgt_orgil,VSCIVCPMI) %>%
dplyr::rename(siteID=StationID_Trend)
And now bring in design status data (for VSCI/VCPMI). It is best to use this as the default design status dataset (and adjust TS to OT if not sampled) because it is the most complete version of the data.
We are going to recode the IR window fields such that each field will indicate whether a sample would have been able to be reported on in that cycle from 2001-cycle chosen. e.g. a site sampled in 2006 would be available for all IR windows but a site sampled in 2015 would only be available for 2018 IR.
designStatus <- readxl::read_excel('originalData/biology.xlsx',sheet='biology2018') %>%
mutate(
# recode for special purposes of this analysis, not actual IR window years
IR2008 = ifelse(Year <= 2006, 1, NA),
IR2010 = ifelse(Year <= 2008, 1, NA),
IR2012 = ifelse(Year <= 2010, 1, NA),
IR2014 = ifelse(Year <= 2012, 1, NA),
IR2016 = ifelse(Year <= 2014, 1, NA),
IR2018 = ifelse(Year <= 2016, 1, NA),
# resume other changes to get functions to work without too much recoding
Panel1 = ifelse(Panel == "Phase1", 1, NA),
Panel2 = ifelse(Panel == "Phase2", 1, NA),
BioPanel1 = ifelse(BioPanel == "Phase1", 1, NA),
BioPanel2 = ifelse(BioPanel == "Phase2", 1, NA),
BioPanel3 = ifelse(BioPanel == "Phase3", 1, NA),
BioPanel4 = ifelse(BioPanel == "Phase4", 1, NA)) %>% # housekeeping: recode biophase and change 1's to NA's to indicate it wasnt sampled in that window, break up Panel and BioPanel windows to separate columns for weight adjustment purposes
dplyr::rename(siteID=sampleID ) # start playing nicely with spsurvey)
These are the data windows we are looking at for this analysis:
The following functions run the CDF analyses for VSCI/VCPMI from the surveyData dataframe. The subpopEstimate() function runs each subpopulation and nests inside the listOfResults(), which outputs a list of all subpopulation results. The allCDFresults() adjusts all weights according to sample window and then calls the listOfResults() to run all the CDF, percentile, and population estimates for each subpopulation. See below for how to run each of these functions and how to view outputs. *These functions are reduced version of their counterparts in ‘HowToWrite2018IRProbMonChapter.Rmd’ to speed analyses to just the question at hand.
# Edited 7/17/2019 to output column name to make extracting data from map process easier with purrr
vlookupEVJ2 <- function(table, #the table where you want to look for it; will look in first column
ref, #the value or values that you want to look for
column, #the column that you want the return data to come from,
range=FALSE, #if there is not an exact match, return the closest?
larger=FALSE) #if doing a range lookup, should the smaller or larger key be used?)
{
if(!is.numeric(column) & !column %in% colnames(table)) {
stop(paste("can't find column",column,"in table"))
}
if(range) {
if(!is.numeric(table[,1])) {
stop(paste("The first column of table must be numeric when using range lookup"))
}
table <- table[order(table[,1]),]
index <- findInterval(ref,table[,1])
if(larger) {
index <- ifelse(ref %in% table[,1],index,index+1)
}
output <- table[index,column]
output[!index <= dim(table)[1]] <- NA
names(output) <- names(table[column])
} else {
output <- table[match(ref,table[,1]),column]
output[!ref %in% table[,1]] <- NA #not needed?
names(output) <- names(table[column])
}
#dim(output) <- dim(ref)
output
}
## Subpopulation estimates by parameter
subpopEstimate <- function(finalData,parameter, subpopulationCategory, subpopulation, altName, specialWeight){
# Build in catch in case whole population needed
if(is.na(subpopulationCategory) & subpopulation == "Virginia"){
subpopData <- finalData
sites.ext <- select(subpopData, siteID) %>% mutate(Use = TRUE)
subpop.ext <- select(subpopData,siteID) %>% mutate(Region = 'Virginia')
}else{
subpopData <- finalData[ finalData[[subpopulationCategory]] == subpopulation, ]
if(nrow(subpopData) == nrow(finalData)){ # special catch for IR windows not filtering correctly
subpopData <- finalData[ !is.na(finalData[[subpopulationCategory]] ), ]
}}
# If no data in subpopulation, keep moving
if(nrow(subpopData) !=0){
sites.ext <- select(subpopData, siteID) %>% mutate(Use = TRUE)
# Special Cases to match existing terminology for each Subpopulation
if(is.na(altName)){
subpop.ext <- select(subpopData,siteID) %>% mutate(Region = subpopulation)
}else{
subpop.ext <- select(subpopData,siteID) %>% mutate(Region = altName)
}
# Choose correct final weight and filter to stratum = 1
if(specialWeight==FALSE){
finalweight <- subpopData$finalweight_all
}else{
finalweight <- as.numeric(as.matrix(subpopData[,specialWeight]))
}
design.ext <- mutate(subpopData,siteID = siteID, stratum = "1", wgt = finalweight, xcoord = xmarinus, ycoord = ymarinus) %>%
select(siteID, stratum, wgt, xcoord, ycoord)
data.cont.ext <- select(finalData, siteID, parameter)
subpopStatus <- cont.analysis(sites = data.frame(sites.ext), subpop = data.frame(subpop.ext), design = data.frame(design.ext),
data.cont = data.frame(data.cont.ext),
pctval=c(1, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 99),
conf=95, vartype="Local")
return(subpopStatus)}else{return(list())}
}
# Now build it all into a single function that returns a list of lists (one list object per estimate,
# each estimate is a list of of 4 dataframes where the important ones ($CDF and $Pct) can easily be
# reached with list calls noted at bottom of page)
listOfResults <- function(popstatus.est, finalData, parameterName, specialWeight){
list(
popstatus.est = popstatus.est,
dataAnalyzed = finalData,
# By Basin
estimateJames = subpopEstimate(finalData,parameterName, 'Basin', 'James','James Basin', specialWeight=specialWeight) ,
estimateBigSandy = subpopEstimate(finalData,parameterName, 'SubBasin', 'Big Sandy',NA, specialWeight=specialWeight)
)
}
allCDFresults <- function(designStatus, surveyData, parameter, specialWeight){
# Organize new parameter data
parameterData <- data.frame(select(surveyData,siteID,parameter))
parameterData[,2] <- as.numeric(as.character(parameterData[,2])) # change to numeric in case it was saved as anything else
# Change status if it wasnt sampled
initialWeightData <- left_join(designStatus,parameterData,by='siteID') %>%
mutate(parameter_status=ifelse(status == "TS" & is.na(!!as.name(parameter)),"OT", # if TS but not sampled, change to OT
ifelse(status == "OT" & is.na(!!as.name(parameter)),"OT", # if OT but not sampled, keep OT
ifelse(status == "PD", "PD", # if PD, keep it
ifelse(status == 'NT','NT', # if NT, keep it
ifelse(status == "TS",'TS', NA))))), # if TS AND sampled, keep as TS
designweightoriginal = as.factor(`strahler order`), # easier to change as factor
designweightoriginal = dplyr::recode(designweightoriginal,`1`="3790.5165999999999",
`2`="947.62919999999997",
`3`="541.50239999999997",
`4`="315.87639999999999",
`5`="140.3895",
`6`="140.3895")) # overwrite all design weights back to original
# Adjust initial weights for trend stations across various data windows
# First calculate number of times sampled in each window
trendWeightAdjustments <- mutate(initialWeightData,
designweightoriginal = as.numeric(as.character(designweightoriginal)), #factor to numeric
siteIDoriginal=gsub("_.*$", "", siteID)) %>% # get rid of concatenated year for trends to make calculations easier
# Full window
group_by(siteIDoriginal) %>%
mutate(nYearsSampled = ifelse(is.na(siteIDoriginal),NA,n())) %>%
ungroup()%>%
# 2018 IR
group_by(siteIDoriginal, IR2018) %>%
mutate(nYearsSampled_IR2018 = ifelse(is.na(IR2018),NA,n())) %>%
ungroup() %>%
# 2016 IR
group_by(siteIDoriginal, IR2016) %>%
mutate(nYearsSampled_IR2016 = ifelse(is.na(IR2016),NA,n())) %>%
ungroup()%>%
# 2014 IR
group_by(siteIDoriginal, IR2014) %>%
mutate(nYearsSampled_IR2014 = ifelse(is.na(IR2014),NA,n()))%>%
ungroup()%>%
# 2012 IR
group_by(siteIDoriginal, IR2012) %>%
mutate(nYearsSampled_IR2012 = ifelse(is.na(IR2012),NA,n())) %>%
ungroup()%>%
# 2010 IR
group_by(siteIDoriginal, IR2010) %>%
mutate(nYearsSampled_IR2010 = ifelse(is.na(IR2010),NA,n())) %>%
ungroup()%>%
# 2008 IR
group_by(siteIDoriginal, IR2008) %>%
mutate(nYearsSampled_IR2008 = ifelse(is.na(IR2008),NA,n())) %>%
ungroup() %>%
# Divide out weight by years sampled in each window
# if the site wasnt sampled in a window, give it the original weight but it will be adjusted according to an OT status later
mutate(designweight_all= designweightoriginal/nYearsSampled,
designweight_IR2018= ifelse(is.na(nYearsSampled_IR2018),designweightoriginal,designweightoriginal/nYearsSampled_IR2018),
designweight_IR2016= ifelse(is.na(nYearsSampled_IR2016),designweightoriginal,designweightoriginal/nYearsSampled_IR2016),
designweight_IR2014= ifelse(is.na(nYearsSampled_IR2014),designweightoriginal,designweightoriginal/nYearsSampled_IR2014),
designweight_IR2012= ifelse(is.na(nYearsSampled_IR2012),designweightoriginal,designweightoriginal/nYearsSampled_IR2012),
designweight_IR2010= ifelse(is.na(nYearsSampled_IR2010),designweightoriginal,designweightoriginal/nYearsSampled_IR2010),
designweight_IR2008= ifelse(is.na(nYearsSampled_IR2008),designweightoriginal,designweightoriginal/nYearsSampled_IR2008))
## Adjust design weights to get final weights
# Initial sample frame inputs
# List stream order by kilometer it represents
sframe <- c('1st'=51210, '2nd'=13680, '3rd'=7781.08, '4th'=4448.257,
'5th'=1731.302, '6th'=163.901, '7th'=14.7099 )
# recode to factor to make sframe match up to stream order
trendWeightAdjustments$`strahler order` <- as.factor(trendWeightAdjustments$`strahler order`)
levels(trendWeightAdjustments$`strahler order`) <- c('1st','2nd','3rd','4th','5th','6th','7th')
finalWeights <- select(trendWeightAdjustments,siteID:IR2018,siteIDoriginal,designweightoriginal,designweight_all:designweight_IR2008,
parameter_status,!!as.name(parameter))
finalWeights$finalweight_Year <- adjwgt(rep(TRUE,length(finalWeights$designweightoriginal)),finalWeights$designweightoriginal,
finalWeights$`strahler order`, sframe)
finalWeights$finalweight_all <- adjwgt(rep(TRUE,length(finalWeights$designweight_all)),finalWeights$designweight_all,
finalWeights$`strahler order`, sframe)
finalWeights$finalweight_IR2018 <- adjwgt(rep(TRUE,length(finalWeights$designweight_IR2018)),finalWeights$designweight_IR2018,
finalWeights$`strahler order`, sframe)
finalWeights$finalweight_IR2016 <- adjwgt(rep(TRUE,length(finalWeights$designweight_IR2016)),finalWeights$designweight_IR2016,
finalWeights$`strahler order`, sframe)
finalWeights$finalweight_IR2014 <- adjwgt(rep(TRUE,length(finalWeights$designweight_IR2014)),finalWeights$designweight_IR2014,
finalWeights$`strahler order`, sframe)
finalWeights$finalweight_IR2012 <- adjwgt(rep(TRUE,length(finalWeights$designweight_IR2012)),finalWeights$designweight_IR2012,
finalWeights$`strahler order`, sframe)
finalWeights$finalweight_IR2010 <- adjwgt(rep(TRUE,length(finalWeights$designweight_IR2010)),finalWeights$designweight_IR2010,
finalWeights$`strahler order`, sframe)
finalWeights$finalweight_IR2008 <- adjwgt(rep(TRUE,length(finalWeights$designweight_IR2008)),finalWeights$designweight_IR2008,
finalWeights$`strahler order`, sframe)
# Change decimal degree coordinates to marinus for equal area projection (needed for spsurvey::cat.analysis())
marinus <- spsurvey::marinus(finalWeights$`Latitude-DD`,finalWeights$`Longitude-DD`)
finalWeights <- cbind(finalWeights,data.frame(xmarinus=marinus$x,ymarinus=marinus$y))
rm(marinus)
####Stream Extent and Status Estimate
siteExtent <- data.frame(siteID=finalWeights$siteID, Use=rep(TRUE,nrow(finalWeights)) )
subpopExtent <- data.frame(siteID=finalWeights$siteID,Region=rep('Virginia',nrow(finalWeights)) )
designExtent <- data.frame(siteID=finalWeights$siteID,stratum=rep(1,nrow(finalWeights)),
wgt=finalWeights$finalweight_all, xcoord=finalWeights$xmarinus,ycoord=finalWeights$ymarinus)
StatusTNT <- finalWeights$status
levels(StatusTNT) <- list(T=c('TS','PD','OT'), NT=c('NT') )
data.cat.ext <- data.frame(finalWeights[,c('siteID','status')],StatusTNT)
popstatus.est <- cat.analysis(sites = siteExtent, subpop = subpopExtent, design = designExtent,
data.cat = data.cat.ext, conf=95, vartype="Local")
dataAnalyzed <- filter(finalWeights, parameter_status=='TS') %>%
select(siteID,!!as.name(parameter),parameter_status,everything())
output <- listOfResults(popstatus.est, dataAnalyzed, parameter, specialWeight)
return(output)
}
Now we subset the original surveydata to reproduce what would happen if we went back in time and ran CDF curves on smaller datasets for each IR cycle. You will see the number of sites in the Big Sandy creep up slowly in comparison to the James River Basin.
surveyDataJBS <- filter(surveyData, SubBasin %in% c("Big Sandy", "James")) # 30 Big Sandy
surveyDataJBS_2008 <- filter(surveyDataJBS, Year <= 2006) # 10 Big Sandy
surveyDataJBS_2010 <- filter(surveyDataJBS, Year <= 2008)# 12 Big Sandy
surveyDataJBS_2012 <- filter(surveyDataJBS, Year <= 2010)# 17 Big Sandy
surveyDataJBS_2014 <- filter(surveyDataJBS, Year <= 2012)# 22 Big Sandy
surveyDataJBS_2016 <- filter(surveyDataJBS, Year <= 2014)# 27 Big Sandy
yearBasin <- tibble(Year= c(2008, 2010, 2012, 2014, 2016, 2018),
`n Big Sandy` = c(nrow(filter(surveyDataJBS_2008, SubBasin == 'Big Sandy')),
nrow(filter(surveyDataJBS_2010, SubBasin == 'Big Sandy')),
nrow(filter(surveyDataJBS_2012, SubBasin == 'Big Sandy')),
nrow(filter(surveyDataJBS_2014, SubBasin == 'Big Sandy')),
nrow(filter(surveyDataJBS_2016, SubBasin == 'Big Sandy')),
nrow(filter(surveyDataJBS, SubBasin == 'Big Sandy'))),
`n James` = c(nrow(filter(surveyDataJBS_2008, SubBasin == 'James')),
nrow(filter(surveyDataJBS_2010, SubBasin == 'James')),
nrow(filter(surveyDataJBS_2012, SubBasin == 'James')),
nrow(filter(surveyDataJBS_2014, SubBasin == 'James')),
nrow(filter(surveyDataJBS_2016, SubBasin == 'James')),
nrow(filter(surveyDataJBS, SubBasin == 'James'))))
datatable(yearBasin, rownames = FALSE, options = list(dom='t'))
The next chunk runs the population estimates for each subset of our full dataset. Note that the weights are adjusted according to the input data window.
#VSCIJBS_2008 <- allCDFresults(designStatus, surveyDataJBS_2008,'VSCIVCPMI','finalweight_IR2008')
#VSCIJBS_2010 <- allCDFresults(designStatus, surveyDataJBS_2010,'VSCIVCPMI','finalweight_IR2010')
#VSCIJBS_2012 <- allCDFresults(designStatus, surveyDataJBS_2012,'VSCIVCPMI','finalweight_IR2012')
#VSCIJBS_2014 <- allCDFresults(designStatus, surveyDataJBS_2014,'VSCIVCPMI','finalweight_IR2014')
#VSCIJBS_2016 <- allCDFresults(designStatus, surveyDataJBS_2016,'VSCIVCPMI','finalweight_IR2016')
#VSCIJBS_2018 <- allCDFresults(designStatus, surveyDataJBS,'VSCIVCPMI','finalweight_IR2018')
## save everything for faster rendering of .Rmd later
#saveRDS(VSCIJBS_2008, 'processedData/VSCIJBS_2008.RDS')
#saveRDS(VSCIJBS_2010, 'processedData/VSCIJBS_2010.RDS')
#saveRDS(VSCIJBS_2012, 'processedData/VSCIJBS_2012.RDS')
#saveRDS(VSCIJBS_2014, 'processedData/VSCIJBS_2014.RDS')
#saveRDS(VSCIJBS_2016, 'processedData/VSCIJBS_2016.RDS')
#saveRDS(VSCIJBS_2018, 'processedData/VSCIJBS_2018.RDS')
# bring it back in
VSCIJBS_2008 <- readRDS('processedData/VSCIJBS_2008.RDS')
VSCIJBS_2010 <- readRDS('processedData/VSCIJBS_2010.RDS')
VSCIJBS_2012 <- readRDS('processedData/VSCIJBS_2012.RDS')
VSCIJBS_2014 <- readRDS('processedData/VSCIJBS_2014.RDS')
VSCIJBS_2016 <- readRDS('processedData/VSCIJBS_2016.RDS')
VSCIJBS_2018 <- readRDS('processedData/VSCIJBS_2018.RDS')